Model Selection

Multimodal dialogue

# Multimodal dialogue

Spatial LLaVA 7B Gguf

Spatial-LLaVA-7B is a multimodal model fine-tuned based on the LLaVA model, focusing on improving the ability of spatial relationship reasoning and suitable for multimodal research and chatbot development.

Qwen3 8B NEO Imatrix Max GGUF

A NEO Imatrix quantized version based on the Qwen3-8B model, supporting 32K long context and enhanced reasoning ability

Large Language Model

VL Rethinker 72B Mlx 4bit

4-bit quantized version of VL-Rethinker-72B, optimized for Apple devices using the MLX framework, supporting visual question answering tasks.

Text-to-Image English

Gemma 3 12b It GPTQ 4b 128g

This model is an INT4 quantized version of google/gemma-3-12b-it, using the GPTQ algorithm to reduce parameters from 16-bit to 4-bit, significantly decreasing disk space and GPU memory requirements.

Qwen2.5 VL 7B Instruct GPTQ Int4

Qwen2.5-VL-7B-Instruct-GPTQ-Int4 is an unofficial GPTQ-Int4 quantized version based on the Qwen2.5-VL-7B-Instruct model, supporting multimodal tasks from image-text to text.

Transformers Supports Multiple Languages

Llama3.1 Typhoon2 Audio 8b Instruct

Typhoon 2-Audio Edition is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs while simultaneously generating both text and speech outputs. The model is specifically optimized for Thai language while also supporting English.

Transformers Supports Multiple Languages

ChatRex is a perception-specialized multimodal large language model capable of associating answers with specific objects while responding to questions.

Image-to-Text English

GLM-Edge-V-5B is a 5-billion-parameter multimodal model that supports image and text inputs, capable of performing image understanding and text generation tasks.

GLM-Edge-V-2B is an image-text-to-text model based on the PyTorch framework, supporting Chinese processing.

MMDuet is a VideoLLM model that supports real-time interaction during video playback, focusing on time-sensitive video understanding tasks.

Video-to-Text English

Aria Sequential Mlp FP8 Dynamic

FP8 dynamically quantized model based on Aria-sequential_mlp, suitable for image-text-to-text tasks, requiring approximately 30GB VRAM.

Qwen2 Vl Tiny Random

This is a small debugging model randomly initialized based on the configuration of Qwen2-VL-7B-Instruct, used for vision-language tasks.

Internvideo2 Chat 8B HD

InternVideo2-Chat-8B-HD is a video understanding model that combines a large language model and VideoBLIP. It is constructed through a progressive learning scheme and can handle high-definition video input.

Llava Llama 2 13b Chat Lightning Preview

LLaVA is an open-source multimodal chatbot model based on the Transformer architecture, obtained by fine-tuning LLaMA/Vicuna on multimodal instruction-following data generated by GPT.

Blip2 Opt 2.7b 8bit

BLIP-2 is a vision-language pre-trained model that combines an image encoder and a large language model for image-to-text generation tasks.

Transformers English

Mediocreatmybest

Blip2 Image To Text

BLIP-2 is a vision-language pre-trained model that achieves language-image pre-training guidance by freezing the image encoder and large language model.

Transformers English

Minigpt 4 LLaMA 7B

MiniGPT-4 is a multimodal model that combines visual and language capabilities and is developed based on the Vicuna language model.

Llava 13b V0 4bit 128g

LLaVA is a multimodal model combining vision and language, based on the LLaMA architecture, supporting image understanding and dialogue generation.

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase